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ABSTRACT 

Differentially private collaborative filtering is a challenging 
task, both in terms of accuracy and speed. We present a 
simple algorithm that is provably differentially private, while 
offering good performance, using a novel connection of differ¬ 
ential privacy to Bayesian posterior sampling via Stochastic 
Gradient Langevin Dynamics. Due to its simplicity the al¬ 
gorithm lends itself to efficient implementation. By careful 
systems design and by exploiting the power law behavior of 
the data to maximize CPU cache bandwidth we are able to 
generate 1024 dimensional models at a rate of 8.5 million 
recommendations per second on a single PC. 

Keywords 
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1. INTRODUCTION 

Privacy protection in recommender systems is a notori¬ 
ously challenging problem. There are often two compet¬ 
ing goals at stake: similar users are likely to prefer similar 
products, movies, or locations, hence sharing of preferences 
between users is desirable. Yet, at the same time, this exac¬ 
erbates the type of privacy sensitive queries, simply since we 
are now not looking for aggregate properties from a dataset 
(such as a classifier) but for properties and behavior of other 
users ‘just like’ this specific user. Such highly individualized 
behavioral patterns are shown to facilitate provably effective 
user de-anonymization [23, 36]. 

Consider the case of a couple, both using the same location 
recommendation service. Since both spouses share much of 
the same location history, it is likely that they will receive 
similar recommendations, based on other users’ preferences 
similar to theirs. In this context sharing of information is 
desirable, as it improves overall recommendation quality. 

Moreover, since their location history is likely to be very 
similar, each of them will also receive recommendations to 
visit the place that their spouse visited (e.g. including places 
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of ill repute), regardless of whether the latter would like 
to share this information or not. This creates considerable 
tension in trying to satisfy those two conflicting goals. 

Differential privacy offers tools to overcome these prob¬ 
lems. Loosely speaking, it offers the participants plausible 
deniability in terms of the estimate. That is, it provides 
guarantees that the recommendation would also have been 
issued with sufficiently high probability if another specific 
participant had not taken this action before. This is pre¬ 
cisely the type of guarantee suitable to allay the concerns in 
the above situation [8]. 

Recent work, e.g. by Mcsherry and Mironov [ 18] has fo¬ 
cused on designing custom built tools for differential private 
recommendation. Many of the design decisions in this con¬ 
text are hand engineered, and it is nontrivial to separate 
the choices made to obtain a differentially private system 
from those made to obtain a system that works well. Fur¬ 
thermore, none of these systems [18, 35] lead to very fast 
implementations. 

In this paper we show that a large family of recommender 
systems, namely those using matrix factorization, are well 
suited to differential privacy. More specifically, we exploit 
the fact that sampling from the posterior distribution of a 
Bayesian model, e.g. via Stochastic Gradient Langevin Dy¬ 
namics (SGLD) [34], can lead to estimates that are suffi¬ 
ciently differentially private [33]. At the same time, their 
stochastic nature makes them well amenable to efficient im¬ 
plementation. Their generality means that we need not 
custom-design a statistical model for differential privacy but 
rather that is possible to retrofit an existing model to satisfy 
these constraints. The practical importance of this fact can¬ 
not be overstated — it means that no costly re-engineering 
of deployed statistical models is needed. Instead, one can 
simply reuse the existing inference algorithm with a trivial 
modification to obtain a differentially private model. 

This leaves the issue to performance. Some of the best 
reported results are those using GraphChi [14], which show 
that state-of-the-art recommender systems can be built us¬ 
ing just a single PC within a matter of hours, rather than 
requiring hundreds of computers. In this paper, we show 
that by efficiently exploiting the power law properties inher¬ 
ent in the data (e.g. most movies are hardly ever reviewed on 
Netflix), one can obtain models that achieve peak numerical 
performance for recommendation. More to the point, they 
are 3 times faster than GraphChi on identical hardware. 

In summary, this paper describes the by far the fastest 
matrix factorization based recommender system and it can 
be made differentially privately using SGLD without losing 


performance. Most competing approaches excel at no more 
than one of those aspects. Specifically, 

1. It is efficient at the state of the art relative to other matrix 
factorization systems. 

• We develop a cache efficient matrix factorization frame¬ 
work for general SGD updates. 

• We develop a fast SGLD sampling algorithm with book¬ 
keeping to avoid adding the Gaussian noise to the whole 
parameter space at each updates while still maintaining 
the correctness of the algorithm. 

2. And it is differentially private. 

• We show that sampling from a scaled posterior distri¬ 
bution for matrix factorization system can guarantee 
user-level differential privacy. 

• We present a personalized differentially private method 
for calibrating each user’s privacy and accuracy. 

• We only privately release V to public, and design a local 
recommender system for each user. 

Experiments confirm that the algorithm can be implemented 
with high efficiency, while offering very favorable privacy- 
accuracy tradeoff that nearly matches systems without dif¬ 
ferential privacy at meaningful privacy level. 

2. BACKGROUND 

We begin with an overview of the relevant ingredients, 
namely collaborative filtering using matrix factorization, dif¬ 
ferential privacy and a primer in computer architecture. All 
three are relevant to the understanding of our approach. In 
particular, some basic understanding of the cache hierarchy 
in microprocessors is useful for efficient implementations. 

2.1 Collaborative Filtering 

In collaborative filtering we assume that we have a set of 
U users, rating V items. We only observe a small number of 
entries rij in the rating matrix R. Here rij means that user 
i rated item j. A popular tool [13] to deal with inferring 
entries in R £ Rl w l x l v l j s approximate J? by a low rank 
factorization, i.e. 

R « UV T where U G R |w|fc and V G R |v|fe (1) 

for some k G N, which denotes the dimensionality of the 
feature space corresponding to each item and movie. In 
other words, (user,item) interactions are modeled via 

rij « (■ Ui,Vj ) + 6“ + b™ + b 0 . (2) 

Here Ui and Vj denote row-vectors of U and V respectively, 
and 6“ and b™ are scalar offsets responsible for a specific 
user or movie respectively. Finally, bo is a common bias. 

A popular interpretation is that for a given item j , the el¬ 
ements of Vj measure the extent to which the item possesses 
those attributes. For a given user i the elements of Ui mea¬ 
sure the extent of interest that the user has in items that 
score highly in the corresponding factors. Due to the condi¬ 
tions proposed in the Netflix contest, it is common to aim 
to minimize the mean squared error of deviations between 
true ratings and estimates. To address overfitting, a norm 
penalty is commonly imposed on U and V. This yields the 
following optimization problem 

min ( r H - i u iy v i) ~ b i - b T ~ b 0 f + K\\U\\l + \\V\\%) 

u,v z —' 

i,jeR 


A large number of extensions have been proposed for this 
model. For instance, incorporating co-rating information 
[27], neighborhoods, or temporal dynamics [12] can lead to 
improved performance. Since we are primarily interested 
in demonstrating the efficacy of differential privacy and the 
interaction with efficient systems design, we focus on the 
simple inner-product model with bias. 

Bayesian View. Note that the above optimization problem 
can be viewed as an instance of a Maximum-a-Posteriori 
estimation problem. That is, one minimizes 

- logp([7, V\R, A,, A u , A v ) = - log Af(R\ (U, V) , A" 1 ) 

- log Af([/|0, A” 1 ) - log Af(V\0, A" 1 ) 
where, up to a constant offset 

- log p(nj\ui,Vj) = A r (rij - ( m,Vj) - bi - b™ - b 0 f 

and —logp(U) = UA U U T and likewise for V. In other 
words, we assume that the ratings are conditionally normal, 
given the inner product {Ui,Vj ), and the factors «; and Vj are 
drawn from a normal distribution. Moreover, one can also 
introduce priors for \ r ,A u ,A v with a Gamma distribution 

G(-\a,P). 

While this setting is typically just treated as an afterthought 
of penalized risk minimization, we will explicitly use this 
when designing differentially private algorithms. The ratio¬ 
nale for this is the deep connection between samples from 
the posterior and differentially private estimates. We will 
return to this aspect after introducing Stochastic Gradient 
Langevin Dynamics. 

Stochastic Gradient Descent. Minimizing the regular¬ 
ized collaborative filtering objective is typically achieved by 
one of two strategies: Alternating Least Squares (ALS) and 
stochastic gradient descent (SGD). The advantage of the 
former is that the problem is biconvex in U and V respec¬ 
tively, hence minimizing U\V or V\U are convex. On the 
other hand, SGD is typically faster to converge and it also 
affords much better cache locality properties. Instead of ac¬ 
cessing e.g. all reviews for a given user (or all reviews for a 
given movie) at once, we only need to read the appropriate 
tuples. In SGD each time we update a randomly chosen 
rating record by: 

Ui *- (1 — r/ t X)ui + rjtVj ( r tj - (m,Vj) - bi - bf - bo) 

Vj <- (1 - VtX)vj + rjtUi (- (m, Vj) - bi - b" 1 - b 0 ) (3) 

One problem of SGD is that trivially parallelizing the proce¬ 
dure requires memory locking and synchronization for each 
rating, which could significantly hamper the performance. 
[25] shows that a lock-free scheme can achieve nearly opti¬ 
mal solution when the data access is sparse. We build on this 
statistical property to obtain a fast system which is suitable 
for differential privacy. 

2.2 Differential Privacy 

Differential privacy (DP) [7, 9] aims to provide means to 
cryptographically protect personal information in the database, 
while allowing aggregate-level information to be accurately 
extracted. In our context this means that we protect user- 
specific sensitive information while using aggregate informa¬ 
tion to benefit all users. 

Assume the actions of a statistical database are modeled 
via a randomized algorithm A. Let the space of data be X 
and data sets X, Y £ X n . Define d(X,Y) to be the edit 


distance or Hamming distance between data set X and Y, 
for instance if A' and Y are the same except one data point 
then we have d(X, Y) =e 1. 

Definition 1 (Differential Privacy). We call a randomized 
algorithm A (e, 5)-differentially private if for all measurable 
sets S C Range(A) and for all X , X' £ X n such that the 
hamming distance d{X,X') = 1, 

P(.ApO € S) < exp(e)P(M(A') e S) + 5 
If 8 = 0 we say that A is e-differential private. 

The definition states that if we arbitrarily replace any indi¬ 
vidual data point in a database, the output of the algorithm 
doesn’t change much. The parameter e in the definition 
controls the maximum amount of information gain about an 
individual person in the database given the output of the 
algorithm. When t is small, it prevents any forms of linkage 
attack to individual data record (e.g., linkage of Netflix data 
to IMDB data [23]). We refer readers to [8] for detailed in¬ 
terpretations of the differential privacy in statistical testing, 
Bayesian inference and information theory. 

An interesting side-effect of this definition in the context 
of collaborative filtering is that it also limits the influence 
of so-called whales, i.e. of users who submit extremely large 
numbers of reviews. Their influence is also curtailed, at 
least under the assumption of an equal level of differential 
privacy per user. In other words, differential privacy confers 
robustness for collaborative filtering. 

Wang et al. [33] show that posterior sampling with bounded 
log-likeliliood is essentially exponential mechanism [19] there¬ 
fore protecting differential privacy for free (similar observa¬ 
tions were made independently in [21, 5]). Wang et al. [33] 
also suggests a recent line of works [34, 4, 6] that use stochas¬ 
tic gradient descent for Hybrid Monte Carlo sampling essen¬ 
tially preserve differential privacy with the same algorithmic 
procedure. The consequence for our application is very inter¬ 
esting: if we trust that the MCMC sampler has converged, 
i.e. if we get a sample that is approximately drawn from the 
posterior distribution, then we can use one sample as the 
private release. If not, we can calibrate the MCMC proce¬ 
dure itself to provide differential privacy (typically at the 
cost of getting a much poorer solution). 

2.3 Computer Architecture 

A key difference between generic numerical linear algebra, 
as commonly used e.g. for deep networks or generalized lin¬ 
ear models, and the methods used for recommender systems 
is the fact that the access properties regarding users and 
items are highly nonuniform. This is a significant advan¬ 
tage, since it allows us to exploit the caching hierarchy of 
modern CPUs to benefit from higher bandwidth than what 
disks or main memory access would permit. 

A typical computer architecture consists of a hard disk, 
solid-state drive (SSD), random-access memory (RAM) and 
CPU cache. Many factors affect the real available band¬ 
width, such as read and write patterns, block sizes, etc. We 
measured this for a desktop computer. See Table 1 for a 
quick overview. A good algorithm design should be pushing 
the data flow to CPU cache level and hide the latency from 
SSD or even RAM and amplify the available bandwidth. 

The key strategy in obtaining high throughput collabora¬ 
tive filtering systems is to obtain peak bandwidth on each of 


Device 

Capacity 

Bandwidth read 

Bandwidth write 

Hard Disk 

3TB 

150MB/s 

100MB/s 

SSD 

256GB 

500MB/s 

350MB/s 

RAM 

16GB 

14GB/s 

9GB/s 

L3 Cache 

6MB 

16-44GB/s 

7-30GB/s 

LI Cache 

32KB 

74-135GB/s 

44-80GB/s 


Table 1: Performance (single threaded) on a Mac- 
book Pro (2011) using an Intel Core i7 operating at 
2.0 GHz and 160MT/s transfer rate and 2 memory 
banks. The spread in LI and L3 bandwidth is due 
to different packet sizes. 

the subsystems by efficient caching. That is, if a movie is fre¬ 
quently reused, it is desirable to retain it in the CPU cache. 
This way, we will neither suffer the high latency (100ns per 
request) of a random read from memory, nor will we have to 
pay for the comparably slower bandwidth of RAM relative to 
the CPU cache. This intuition is confirmed in the observed 
cache miss rates reported in the experiments in Section 6. 

3. DIFFERENTIALLY PRIVATE 
MATRIX FACTORIZATION 

We start by describing the key ideas and algorithmic frame¬ 
work for differentially private matrix factorization. The 
method, which involves preprocessing data and then sam¬ 
pling from a scaled posterior distribution, is provably dif¬ 
ferentially private and has profound statistical implications. 
Then we will describe a specific Monte Carlo sampling al¬ 
gorithm: Stochastic Gradient Langevin Dynamics (SGLD) 
and justify its use in our setting. We then come up with 
a novel way to personalize the privacy protection for indi¬ 
vidual users. Finally, we discuss how to develop fast cache- 
efficient solvers to exploit bandwidth-limited hardware such 
that it can be used for general SGD-style algorithms. 

Our differential privacy mechanism relies on a recent ob¬ 
servation that posterior sampling preserves differential pri¬ 
vacy, provided that the log-likelihood of each user is uni¬ 
formly bounded [33]. This simple yet remarkable result sug¬ 
gests that sampling from posterior distribution is differen¬ 
tially private for free to some extent. In our context, the 
claim is that, if 1 ma,xu,v,R,i ( r b — (wi, Vj )) 2 < B then 

the method that outputs a sample from 

P(U,V ) oc exp I -^2 ( r tj - (Ui,Vj)f + A(||t/||| + ||Y|||) j 

\(i,j)eR J 

preserves 4B-differential privacy. Moreover, when we want 
to set the privacy loss e to another number, we can easily 
do this by simply rescaling the entire expression by e/4 B. 

The question now is whether max V)._ „ (m — (ui, Vj )) 2 

U,V,R,i 

is bounded. Since the ratings are bounded between 1 < 
rij < 5 and we can consider a reasonable sublevel set { U, V \ 
maxij | ufvj\ < k}, we have every summand to be bounded 
by (5 + k) 2 . This does not affect the privacy claim as long 
as k is chosen independent to the data. 

B could still be large, if some particular users rated many 
movies. This issue is inevitable even if all observed users 

1 For convenience of notation we will omit the biases from 
the description below in favor of a slightly more succinct 
notation. 








have few ratings, since differential privacy also protects users 
not in the database. We propose two theoretically-inspired 
algorithmic solutions to this problem: 

Trimming: We may randomly delete ratings for those who 
rated a lot of movies so that the maximum number of 
ratings from a single user r will not be too much larger 
than the average number of ratings. This procedure is 
the underlying gem that allows OptSpace (the very 
first provable matrix factorization based low-rank ma¬ 
trix completion method) [ ] to work. 

Reweighting: Alternatively, one can weight each user ap¬ 
propriately so that those who rated many movies will 
have smaller weight for each rating. Mcsherry and 
Mironov [18] used this reweighting scheme for control¬ 
ling privacy loss. A similar approach is considered in 
the study of non-uniform and power-law matrix com¬ 
pletion [20, 29], where the weighted trace norm has the 
same effect as if we reweight the loss-functions. 

In addition, these procedures have their practical benefits 
for the robustness of the recommendation system, since they 
prevent any malicious user from injecting too much impact 
into the system, see e.g., Wang and Xu [32], Mobasher et al. 
[22]. Another justification of these two procedures is that, if 
the fully observed matrix is truly in a low-dimensional sub¬ 
space, neither of these two procedures changes the underly¬ 
ing subspace. Therefore, the solutions should be similar to 
the non-preprocessed version. 

The procedure for differentially private matrix factoriza¬ 
tion (DPMF) is summarized in Algorithm 1. Note that this 
is a conceptual sketch (we will discuss an efficient variant 
thereof later). The following theorem guarantees that our 
procedure is indeed differentially private. 


Algorithm 1 Differentially Private Matrix Factorization 

Require: Partially observed rating matrix R £ R mxn with 
observation mask fi. m = # of movies, n = ff of users. 
Privacy parameter e, a predefined positive parameter k 
such that {U,V | ufvj £ [1 — k, 5 + k] Vi, j}, rating 
range [1, 5], max allowable number of ratings per-user r, 
number of ratings of each user {mi, ...,m„}, weight of 
each user w, tuning parameter A. 

1: B <— maxi = i,...,„ min{r, m;}u)i(5 — 1 + k) 2 . \> Compute 
uniform upper bound. 

2: Trim all users with ratings > r. 

3 :F(U,V):= £ wffRij - uf Vj ) 2 + X(\\U\\ 2 F + \\V\\ 2 F ). 

i(z[i\,j(zQi 

4: Sample (17, V) ~ P(U, V ) oc A F(C W 
5: while ujvj [1 — k, 5 + k] for some i,j do 
6: Sample (U, V ) ~ P(U, V ) oc e~*B F(u ’ v) 

7: return (U, V ) 


Theorem 1. Algorithm 1 obeys e-differential privacy if the 
sample is exact and (e, (1 + e t )8)-differential privacy if the 
sample is from a distribution 8-away from the target distri¬ 
bution in L i distance. 

The proof (given in the appendix), shows that this proce¬ 
dure uses in fact the exponential mechanism [19] with utility 
function being the negative MF objective and its sensitivity 
being 2 B. Note that this can be extended to considerably 


more complex models. This is the strength of our approach, 
namely that a large variety of algorithms can be adapted 
quite easily to differential privacy capable models. 
Statistical properties. How about the utility of this pro¬ 
cedure? We argue that we do not lose much accuracy by 
sampling from the a distribution instead of doing exact op¬ 
timization. Here we define utility/accuracy to be how well 
this output predicts for new data. 

Our matrix factorization formulation can be treated as 
a maximum a posteriori (MAP) estimator of the Bayesian 
Probabilistic Matrix Factorization (BPMF) [26], therefore, 
this distribution we are sampling from is actually a scaled- 
version of the posterior distribution. 

When e = 4 B, Wang et al. [33] shows that a single sam¬ 
ple from the posterior distribution is consistent whenever 
the Bayesian model that gives rise to f(8) is consistent and 
asymptotically only a factor of 2 away from matching the 
Cramer-Rao lower bound whenever the asymptotic normal¬ 
ity (Bernstein-Von Mises Theorem) of the posterior distri¬ 
bution holds. Therefore, we argue that by taking only one 
sample from the posterior distribution, our results will not 
be much worse than estimating the MAP or the posterior 
mean estimator in BPMF. Moreover, since the results do not 
collapse to a point estimator, the output from this sampling 
procedure does not tend to overfit [34]. 

When e < 4 B we will start to lose accuracy, but since 
we are still sampling from a scaled posterior distribution, 
the same statistical property applies and the result remains 
asymptotically near optimal with asymptotic relative effi¬ 
ciency 1 + \J~4B~fe. In fact, monotonic rescaling of U and V 
leaves the relative order of ratings unchanged. 

3.1 Personalized Differential Privacy 

Another interesting feature of the proposed procedure is 
that it allows us to calibrate the level of privacy protection 
for every user independently, via a novel observation that 
weights assigned to different users are linear in the amount 
of privacy we can guarantee for that particular user. 

We will use the same sampling algorithm, and our guar¬ 
antees in Theorem 1 still hold. The idea here is that we can 
customize the system so that we get a lower basic privacy 
protection for all users, say t = 4 B. As we explained ear¬ 
lier this is the level of privacy that we can get more or less 
“for free”. The protection of DP is sufficiently strong as to 
include even those users that are not in the database. 

By adjusting the weight parameter, we can make the pri¬ 
vacy protection stronger for particular users according to 
how much they set they want privacy. This procedure makes 
intuitive sense because if some user wants perfect privacy, we 
can set their weight to 0 and they are effectively not in the 
database anymore. For people who do not care about pri¬ 
vacy, their ratings will be assigned default weight. Formally, 
we define personalized differential privacy as follows: 

Definition 2 (Personalized Differential Privacy). An algo¬ 
rithm A is (e, 5) -personalized differentially private for User 
i in database X if for any measureable set S in the range of 
the algorithm A 

P(A(X) £ S) < e £ P(A(A') £ S) +5. 
for any X £ X n and X' is either X U {xi} or X\{a;;}. 

We claim that: 







Theorem 2. If we set Wi for User-i such that 

Bi := min{r, rm}wi (4 + k ) 2 < B, 

then Algorithm 1 guarantees personalized differential pri¬ 

vacy for Useri. 

The proof is a straigtforward verification of the definition. 
We defer it to the Appendix. Note that if we set e = 4 B (so 
we are essentially sampling from the posterior distribution), 
we get 213;-Personalized DP for user i. 

In summary, if we simply set e = 4 B, the method protects 
4B-differential privacy for everybody at very little cost and 
by setting the weight vector w, we can provide personalized 
service for users who demands more stringent DP protection. 
To the best of our knowledge, this is the first method of its 
kind to protect differential privacy in a personalized fashion. 

4. EFFICIENT SAMPFING VIA SGFD 

Clearly, sampling from exp (— ^ F(U, V)) is nontrivial. 
For a tractable approach we use a recent MCMC method 
named stochastic gradient Langevin dynamics (SGLD) [34], 
which is an annealing of stochastic gradient descent and 
Langevin dynamics that samples from the posterior distri¬ 
bution [2' ]. The basic update rule is 

(ui,Vj) = - rttV (uitVi) F(U,V) + (4) 

where S7( UitVj )F(U, V ) is a stochastic gradient computed us¬ 
ing only one or a small number of ratings. In other words, 
the updates are almost identical to those used in stochastic 
gradient descent. The key difference is that a small amount 
of Gaussian noise is added to the updates. This allows us to 
solve it extremely efficiently. We will describe our efficient 
implementation of this algorithm in Section 5.4. 

The basic idea of SGLD is that when we are far away from 
the basin of convergence, the gradient of the log-posterior 
V(u it v )F(U, V) is much larger than the additional noise so 
the algorithm behaves like stochastic gradient descent. As 
we approach the basin of convergence and ijt becomes small, 
yjrft Vt so the noise dominates and it behaves like a Brow¬ 
nian motion. Moreover, as rjt gets small, the probability of 
accepting the proposal in Metropolis-Hastings adjustment 
converges to 1, so we do not need to do this adjustment at 
all as the algorithm proceeds, as designed above. 

This seemingly heuristic procedure was later shown to be 
consistent in [28, 30], where asymptotic “in-law” and “almost 
sure” convergence of SGLD to the correct stationary distri¬ 
bution are established. More recently, Teh et al. [31] further 
strengthens the convergence guarantee to include any finite 
iterations. This line of work justifies our approach in that if 
we run SGLD for a large number of iterations, we will end 
up sampling from the distribution that provides us (e, 5)- 
differential privacy. By taking more iterations, we can make 
S arbitrarily small. 

5. SYSTEM DESIGN 

The performance improvement over existing libraries such 
as GraphChi are due to both cache efficient design, prefetch¬ 
ing, pipelining, the fact that we exploit the power law prop¬ 
erty of the data, and by judicious optimization of random 
number generation. This leads to a system that comfortably 
surpasses even moderately optimized GPU codes. 


We primarily focus on the Stochastic Gradient Descent 
solver and subsequently we provide some details on how to 
extend this to SGLD. Inference requires a very large number 
of following operations on data: 

• Read a rating triple (i,j,rij), possibly from disk, un¬ 
less the data is sufficiently tiny to fit into RAM. 

• For each given pair (i,j) of users and items fetch the 
vectors Ui and Vj from memory. 

• Compute the inner product (ui,Vj) on the CPU. 

• Update Ui,Vj and write their new values to RAM. 

To illustrate the impact of these operations consider train¬ 
ing a 2, 048 dimensional model on the 10 8 rating triples of 
Netflix. Per iteration this requires over 3.2TB read/write op¬ 
erations to RAM. At a main memory bandwidth of 20GB/s 
and a latency of 100ns for each of the 200 million cache 
misses each pass would take over 6 minutes. Instead, our 
code accomplishes this task in approximately 10 seconds by 
using the steps outlined below. 

5.1 Processing Pipeline 

To deal with the dataflow from disk to CPU, we use a 
pipelined design, decomposing global and local state akin 
to [ ]. This means that we process users sequentially, thus 
reducing the retrieval cost per user, since the operations are 
amortized over all of their ratings. This effectively halves 
IO. Moreover, since the data cannot be assumed to fit into 
RAM, we pipeline reads from disk. This hides latency and 
avoids stalling the CPUs. The writer thread periodically 
snapshots the model, i.e. U and V to disk. 

Note that for personalized recommender systems that re¬ 
quire considerable personalized hidden state, such as topic 
models, or autoregressive processes, we may want to write a 
snapshot of the user-specific data, too. 


Algorithm 2 Cache efficient Stochastic Gradient Descent 

Require: parameters U , V; ratings R; P threads, 

1: preprocessing Split R into B blocks; 

2: procedure Read > Keep pipeline filled 

3: while ^blocks in flight < P do 

4: Read: block b from disk 

5: Sync: notify Update about b 

6: procedure Update > Update U, V 

7: while at least one of P processors is available do 

8: Sync: receive a new block b from Read 

9: for user i in b do 

10: for each rating rij € b from user i do 

11: Prefetch next movie factor Vj +1 from data 

stream 

12: Ui Ui — ?7tV Ui 

13: Vj <- Vj - gt^vj 

14: (V is either the exact or private gradient) 

15: procedure Write 

16: if Bt blocks processed then save U, V 


5.2 Cache Efficiency 

The previous reasoning discussed how to keep the data 
pipeline filled and how to reduce the user-specific cache misses 
by preaggregating them on disk. Next we need to address 







cache efficiency with regard to movies. More to the point, 
we need to exploit cache locality relative to the CPU core 
rather than simply avoiding cache misses. The basic idea is 
that each CPU core exactly reads a cache line (commonly 64 
bytes) from RAM each time, so algorithm designers should 
not waste it until that piece of cache line is fully utilized. 

We exploit the fact that movie ratings follow a power law 
[10], as is evident e.g. on Netflix in Figure 1. This means 
that if we succeed at keeping frequently rated movies in the 
CPU cache, we should see substantial speedups. Note that 
traditional matrix blocking tricks, as widely used for matrix 
multiplications operations are not useful, due to the sparsity 
of the rating matrix R. Instead, we decompose the movies 
into tiers of popularity. To illustrate, considering a decom¬ 
position into three blocks consisting of the Top 500, the Next 
4000, and the remaining long tail. 

Within each block, we process a batch of users simultane¬ 
ously. This way we can preserve the associated user vectors 
Ui in cache and we are likely to cache the movie vectors, too 
(in particular for the Top 500 block). Also, parallelizing all 
the updates for multiple users does not require locks. Movie 
parameters are updated in a Hogwild fashion [25]. 

This design is particularly efficient for low-dimensional 
models since the Top 500 block fits into LI cache (this 
amounts to 44% of all movie ratings in the Netflix dataset), 
the Next 4000 fits into L2, and ratings will typically reside 
in L3. Even in the extreme case of 2048 dimensions we can 
fit about 55% of all ratings into cache, albeit L3 cache. 

5.3 Latency Hiding and Prefetching 

To avoid the penalty for random requests we perform la¬ 
tency hiding by prefetching. That is, we actively request 
Vj in advance before the rating m is to be updated. For 
dimensions less than 256, accurate prefetching leads to a 
dataflow of Vj into LI cache. Beyond that, the size of the 
latent variables could be too big to benefit from the lowest 
level of caching due to limited size of caches in modern com¬ 
puters. We provide a detailed caching analysis in Section 6 
to illustrate the effect of these techniques. 

5.4 Optimizations for SGLD 

The data flow of SGLD is almost analogous to that in 
SGD, albeit with a number of complications. First off, note 
that (4) applies to the whole parameter matrix U, V rather 
than just to a single vector. Following [3] we can derive an 
unbiased approximation of V Ui in (4) which is nonzero only 
for ( Ui,Vj ) as follows: 

N T 

V ui — A "\ r {rij (Uii Vj ) ) Vj T Ui A uUi 

■L'i 

where N, Ni denote number of rating data rated by all and 
rated by user i respectively. The parameters X r ,Au,A v do 
not incur any major cost — A U ,A V are diagonal matrices 
with a Gamma distribution over them. We simply per¬ 
form Gibbs sampling once per round. However, the most 
time-consuming part is to sample the remaining vectors, i.e. 

P (U~ z , V~ l \R, rest) since it both requires dense updates and 
moreover, it requires many random numbers, which adds 
nontrivial cost. 

Dense Updates: Note that unless we encounter the triple 
(i, j, rij) all other parameters are only updated by adding 
Gaussian noise. This means that by keeping track 




# number of ratings 


Figure 1: Distribution of items (Movies/Music 
pieces) as a function of their number of ratings. 
Many movies have 100 ratings or less, while the ma¬ 
jority of ratings focuses on a small number of movies. 

of when a parameter was last updated, we can sim¬ 
ply aggregate the updates (the Normal distribution is 
closed under addition). That is, c; subsequent addi¬ 
tions amount to a single draw from A/”(0, Ciij). The is 
possible since we only need to know the value of m, Vj 
whenever we encounter a new triple. 

Table Lookup: Drawing iid samples from a Gaussian is 
quite costly, easily dominating all other floating point 
operations combined. We address this by pre-generating 
a large table of numbers [17] and then by performing 
random lookup within the table. More to the point, 
a lookup table of r random numbers is statistically 
indistinguishable from the truth until we draw 0(r 2 ) 
samples from it (this follows from the slow rate of con¬ 
vergence for two-sample tests), hence a few MB of data 
suffice. Finally, for cache efficiency, we read contigu¬ 
ous segments with random offset (this adds a small 
amount of dependence which is easily addressed by 
using a larger table). 

A cautionary note is that the impact of this approach 












on privacy, namely how it affects the stationary distri¬ 
bution of the SGLD, is unknown. In our experiments, 
the results are indistinguishable for any moderately 
sized finite look-up tables (see our experiments in Sec¬ 
tion 6.4). 

6. EXPERIMENTS AND DISCUSSION 



Figure 3: Throughput on Yahoo over different num¬ 
ber of cores with dimension 2048. 

We now investigate the efficiency and accuracy of our 
fast SGD solver and Stochastic Gradient Langevin Dynam¬ 
ics solver, compared with state-of-the-art available recom- 
menders. We also explore the differentially private accuracy 
by using our proposed method while varying different pri¬ 
vacy budgets. 

6.1 Comparisons 

We compare the performance of both the SGD solver and 
the SGLD solver to other publicly available recommenders 
and one closed-source solver. In particular, we compare to 
both CPU and GPU solvers, since the latter tend to excel 
in massively parallel floating point operations. 

GraphChi Most of our experiments focus on a direct com¬ 
parison to GraphChi [1J ]. This is primarily due the 
fact that the code for GraphChi is publicly available 
as open source and its very good performance. 

GraphLab Create is a closed source data analysis plat¬ 
form [16]. It is currently the fastest recommender 
system available, being slightly faster than GraphChi. 
We compared our system to GraphLab Create, albeit 
without fine-grained diagnostics that were possible for 
GraphChi. 

BidMach is a GPU based system [37]. It reports runtimes 
of 90, 129 and 600 seconds respectively for 100, 200 



Figure 4: Size of Gaussian lookup table vs. test 
RMSE on Netflix data with dimension 16. 


and 500 dimensions using an Amazon g2.2xlarge in¬ 
stance for the Netflix dataset. 2 This is slower than the 
runtimes of 48, 63, and 83 seconds for 128, 256, and 
512 that we achieve without GPU optimization on a 
c3.8xlarge instance. 

Spark is a distributed system (Spark MLlib) for inferring 
recommendations and factorization. In recent com¬ 
parison the argument has been made that it is some¬ 
what slower 3 than GraphLab while being substantially 
faster than Mahout. 


6.2 Data 

We use two datasets — the well known Netflix Prize dataset, 
consisting of a training set of 99M ratings spanning 480k 
customers and their ratings on almost 18k, each movie be¬ 
ing rated at a scale of 1 to 5 stars. Additionally, we use 
their released validation set which consists of 1.4M ratings 
for validation purposes. 

Secondly, we use the Yahoo music recommender dataset, 
consisting of almost 263M ratings of 635k music items by 
1M users. We also use the released validation set which 
consists of 6M ratings for validation. We re-scale each rat¬ 
ing at a scale of 0 to 5. We compare performance on both 
datasets since their sampling strategies are somewhat in¬ 
comparable (e.g. Netflix has considerable covariate shift in 
the test dataset). Moreover, this larger dataset poses further 
challenges on the cache efficiency due to the larger number 
of items to be recommended. 


2 http://github.com/BIDData/BIDMach/wiki/Benchmarks 

3 http://stanford.edu/~rezab/sparkworkshop/slides/ 
xiangrui.pdf, Slide 31 
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Figure 2: Runtime comparisons of the C-SGD solver, differentially private SGLD solver vs. non-private 
GraphChi/Graphlab on identical hardware, a Amazon AWS c3.8xlarge instance. Note that regardless of the 
dimensionality of the factors (512, 2048) our C-SGD is approximately 2-3 times faster than GraphChi, and 
differentially private SGLD also can be comparable with Graphchi in very high dimension (Top: Netflix, 
Bottom: Yahoo). 


6.3 Runtime 

For efficient computation, GraphChi first needs to pre- 
process data into shards by the proposed parallel sliding 
windows [14]. Once the data is partitioned, it can process 
the graphs efficiently. For comparison, we partition both 
rating matrix of Netflix prize data and Yahoo Music data 
into blocks with each block contains all the ratings come 
from around 1000 users. Each time our algorithms read one 
block from disk. For Graphchi and Graphlab Create we use 
the default partition strategy. We run all the experiments on 
an Amazon c3.8xlarge instance running Ubuntu 14.04 with 
32 CPUs and 60GB RAM. 

For SGD-based methods We initialize the initial learning 
rate and regularizer r/o = 0.02, A = 5 • lO -2 for Netflix data, 
and r/o = {0.1, 0.08,0.06}, A = 5T0 -2 for Yahoo Music data. 
We update learning rate per round as rjt = 170 /t 7 - We also 


use the same decay rate 7 = 1 for both dataset. For our 
fast SGLD solver, we set r/o = {2 • 10 —10 ,1 ■ 10 —10 , 9 • 10 -11 } 
and hyperparameters a = 1.0, /? = 100.0. And we set decay 
rate 7 = 0.6 for Netflix data and 7 = {0.8, 0.9} for Yahoo 
data. In practice to speed up SGLD’s burn-in procedure, 
we multiply learning rate by a temperature parameter ( [ ] 
in the Gaussian noise Af(0, C • Vt) with ^/C ■ Vt r/t- We set 
£ = {0.07, 0.9} for Netflix data and Yahoo data. 

Since it is nontrivial to observe the test R.MSE error in 
each epoch when using Graphlab Create, we only report 
the timing of Graphlab Create and all other methods in 
Figure 5. Note that we were unable to obtain performance 
results from BidMach for the Yahoo dataset, since Scala 
encountered memory management issues. However, we have 
no reason to believe that the results would be in any way 
more favorable to BidMach than the findings on the Netflix 
dataset. For reproducibility the results were carried out on 









































an AWS g2.8xlarge instance. 

To illustrate the convergence over time. We run all the 
methods in a fixed number of epochs. That is 15 epochs and 
30 epochs respectively because we observe that our SGD 
solver can reach the convergence at that time. Figure 2 
shows our timing results along with convergence while we 
vary dimensions of the models. 

Both of our solvers, i.e. C-SGD and Fast SGLD benefit 
from our caching algorithm. C-SGD is around 2 to 3 times 
faster than GraphChi and Graphlab while simultaneously 
outperforming the accuracy of GraphChi. The primary rea¬ 
son for the discrepancy in performance can be found in the 
order in which GraphChi processes data: it partitions data 
(bother users and items) into random subsets and then op¬ 
timizes only over one such subblock at a time. While the 
latter is fast, it negatively affects convergence, as can be seen 
in Figure 2. 

Note that the algorithm required for Fast SGLD is rather 
more complex, since it performs sampling from the Bayesian 
posterior. Consequently, it is slower than plain SGD. Nonethe¬ 
less, its speed is comparable to GraphChi in terms of through¬ 
put (despite the latter solving a much simpler problem). One 
problem of SGLD is that the more complex the models are, 
the worse its convergence becomes, due to the fact that we 
are sampling from a large state space. This is possibly due 
to the slow mixing of SGLD, which is a known problem of 
SGLD [2] . Improving the mixing rate by considering a more 
advanced stochastic differential equation based sampler, e.g. 
[4, 6], while keeping the cache efficiency during the updates 
will be important future work. To our best knowledge we 
are the first to report the convergence results of SGLD at 
this scale. 


6.4 Convergence 

As described above, the convergence of SGLD and SGD 
based methods are quite different. We illustrate the con¬ 
vergence on a small dimension in Figure 6. Basically the 
C-SGD can find a MAP estimate using several rounds and 
then begin overfitting. While SGLD first needs to burn-in 
and then start sampling procedure. Note that SGLD can 
converge very fast in this case. But for higher dimensions, 
SGLD is slower to converge. Careful tuning of the learning 
rate is critical here. 

We also investigated the accuracy of the model as a func¬ 
tion of the size of the Gaussian lookup table. That is, we 
checked whether replacing explicit access to samples from 
the Normal distribution by looking up a consecutive number 
of precomputed parameters from memory is valid. As can 
be seen in Figure 4, for all but the smallest sets, this suffices. 
That is, already once we have more than 10,000 numbers, 
we no longer need a Gaussian random number generator and 
the results obtained are essentially indistinguishable (obvi¬ 
ously for large numbers of dimensions somewhat more terms 
are needed). 

6.5 Cache-efficient Design 

We show the cache efficiency of C-SGD and Graphchi in 
this section. Our data access pattern can accelerate the 
hardward cache prefetching. In the meanwhile we also use 
software prefetching strategies to prefetch movie factors in 
advance. But software prefetching is usually dangerous in 
practice while implementing in practice because we need to 
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Figure 5: Timing comparisons on Netfiix (top, 15 
epochs) and Yahoo (bottom, 30 epochs). 


know the prefetching stride in advance. That is when to 
prefetch those movie factors. In our experiments we set 
prefetching stride to 2 empirically. We set the experiments 
as follows. In each gradient update step given r,j, once 
the parameters e.g. «; and Vj in (3) been read they will 
stay in cache for a while until they be flushed away by new 
parameters. What we really care about in this section is 
if the first time each parameter be read by CPU is already 
staying in cache or not. If it is not in cache then there 
will be a cache miss and will push CPU to idle. After that 
the succeeding updates (the specific updates depend on the 





















Figure 6: Convergence of SGLD for 16-dimensional 
models on Netflix. It is clear that SGLD does not 
overfit (although this is not a substantial issue for 
16 dimensions). 


K 

SC-SGD 

LI Cache L3 Cache 

GraphChi 

LI Cache L3 Cache 

16 

2.84% 

0.43% 

12.77% 

2 . 21 % 

256 

2.85% 

0.50% 

12.89% 

2.34% 

2048 

3.3% 

1.7% 

15% 

9.8% 


Table 2: Cache miss rates in C-SGD and GraphChi. 
The results were obtained using Cachegrind. The 
cache miss rate in GraphChi is considerably higher, 
which explains to some extent the speed difference. 

algorithms e.g. SGD or SGLD) for m and Vj will run on 
cache level. 

We use Cachegrind [15] as a cache profiler and analyze 
cache miss for this purpose. The result in Table 2 shows 
that our algorithm is quite cache friendly when compared 
with GraphChi on all dimensions. This is likely due to the 
way GraphChi ingests data: it traverses one data and item 
block at a time. As a result it has a less efficient portfolio 
of access frequency and it needs to fetch data from memory 
more frequently. We believe this to be both the root cause of 
decreased computational efficiency and slower convergence 
in the code. 

6.6 Privacy and Accuracy 

We now investigate the influence of privacy loss on ac¬ 
curacy. As discussed previously, a small rescaling factor 
B can help us to get a nice bound on the loss function. 
For private collaborative filtering purposes, we first trim 
the training data by setting each user’s maximum allow¬ 
able number of ratings r = 100 and r = 200 for the Net¬ 
flix competition dataset and Yahoo Music data respectively. 


We set B = r(5 — 1 + k) 2 and weight of each user as 
Wi = min(p, m .( 5 _f 1 +K ) 2 ) where k is set to 1. According to 
different trimming strength we have B = 2500 and B = 5000 
for Netflix data and Yahoo data respectively. Note that a 
maximum allowable rating from 100 to 200 is quite reason¬ 
able, since in practice most users rate quite a bit fewer than 
200 movies (due to the power law nature of the rating dis¬ 
tribution). Moreover, for users who have more than 200 
ratings, we actually can get a quite a good approximation 
of their profiles by only using a reasonable size of random 
samples of these ratings. As such we get a dataset with 33M 
ratings for Netflix and 100M ratings for Yahoo Music data. 
We study the prediction accuracy, i.e. the utility of our pri¬ 
vate method by varying the differential privacy budget e for 
fixed model dimensionality K = 16. 

The parameters of the experiment are set as follows. For 
Netflix data, we set p 0 = {6 ■ 10 -10 ,3 • 10 _9 ,3.2 ■ 10 -8 }, 
7 = 0.6, C = {7- 10 -2 , 2.5 • 10~ 3 }, p = {1,10}. For Yahoo 
data, we set p 0 = {1.5 • 10~ 10 ,1.5 • 10~ 9 , 5 • 10~ 10 , 2 ■ 10" 9 }, 
and 7 = {0.8,0.9}, C = {0.05,0.01,0.005}, p = {1,30}. 
In addition, because we are sampling P(I7, V\rest) we fix 
regularizer parameters A u , A„ which are estimated by a non¬ 
private SGLD in this section. 

While we are sampling (U, V) jointly, we essentially only 
need to release V. Users can then apply their own data to 
get the full model and have a local recommender system: 

Mi « A1 + v i v J J2 v i ri i ( 5 ) 

\ j|(»>i)es / i 

The local predictions, i.e. in our context the utility of dif¬ 
ferentially private matrix factorization method, along the 
different privacy loss e are shown in Figure 7. 

More specifically, the model (5) is a two-stage procedure 
which first takes the differentially private item vectors and 
then use the latter to obtain locally non-private user param¬ 
eter estimates. This is perfectly admissible since users have 
no expectation of privacy with regard to their own ratings. 

6.7 Rating privacy, user privacy and average 
personalized privacy 

Interpreting the privacy guarantees can be subtle. A pri¬ 
vacy loss of t = 250 as in Figure 7 may seem completely 
meaningless by Definition 1 and the corresponding results 
in Mcsherry and Mironov [18] may appear much better. 

We first address the comparison to Mcsherry and Mironov 
[18]. It is important to point out that our privacy loss e is 
stated in terms of user level privacy while the results in Mc¬ 
sherry and Mironov [18] are stated in terms of rating level 
privacy, which offers exponentially weaker protection, e-user 
differential privacy translates into e/r-rating differential pri¬ 
vacy. Since r = 200 in our case, our results suggest that we 
almost lose no accuracy at all while preserving rating dif¬ 
ferential privacy with e < 1 . This matches (and slightly 
improves) Mcsherry and Mironov [18] ’s carefully engineered 
system. 

On the other hand, we note that the plain privacy loss 
can be a very deceiving measure of its practical level of pro¬ 
tection. Definition 1 protects privacy of an arbitrary user, 
who can be a malicious spammer that rates every movie in 
a completely opposite fashion as what the learned model 
would predict. This is a truly paranoid requirement, and 
arguably not the right one, since we probably should not 
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Figure 7: Test RMSE vs. privacy loss e on Netflix 
(top) and Yahoo (bottom). A modest decrease in 
accuracy affords a useful gain in privacy. 

protect these malicious users to begin with. For an average 
user, the personalized privacy (Definition 2) guarantee can 
be much stronger, as the posterior distribution concentrates 
around models that predict reasonably well for such users. 
As a result, the log-likelihood associated with these users 
will be bounded by a much smaller number with high prob¬ 
ability. In the example shown in Figure 7, a typical user’s 
personal privacy loss is about e/25, which helps to reduce 
the essential privacy loss to a meaningful range. 

7. CONCLUSION 

In this paper we described an algorithm for efficient collab¬ 
orative filtering that is compatible with differential privacy. 
In particular, we showed that it is possible to accomplish 
all three goals: accuracy, speed and privacy without any 
significant sacrifice on either end. 

Moreover, we introduced the notion of personalized differ¬ 
ential privacy. That is, we defined (and proved) the notion of 
obtaining estimates that respect different degrees of privacy, 
as required by individual users. We believe that this notion 
is highly relevant in today’s information economy where the 
expectation of privacy may be tempered by, e.g. the cost 
of the service, the quality of the hardware (cheap netbooks 
deployed with Windows 8.1 with Bing), and the extent to 
which we want to incorporate the opinions of users. 

Our implementation takes advantage of the caching prop¬ 
erties of modern microprocessors. By careful latency hiding 
we are able to obtain near peak performance. In particu¬ 


lar, our implementation is approximately 3 times as fast as 
GraphChi, the next-fastest recommender system. In sum, 
this is a strong endorsement of Stochastic Gradient Langevin 
Dynamics to obtain differentially private estimates in recom¬ 
mender systems while still preserving good utility. 
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APPENDIX 

Proof of Theorem 1. The e-DP claim follows by choosing 
the utility function to be the —F(U, V) and apply the ex¬ 
ponential mechanism [19] which protects e-DP by output 
( U , V) with probability proportional to exp(— 2 af(u v) F(U, V ) 
Where he sensitivity of function / be defined as 

A f(X)= sup \\f(X)-f(X')\\ 2 . 

x,x'ex n :d(x,x')<i 

All we need to do is to work out the sensitivity for F(U,V) 
here. By the constraint in U, V and 1 < r rl < 5, we know 
(rij —ujvj ) 2 < (5 + k ) 2 . Since one user contributes only one 
row to the data the trimming/reweighting procedure ensures 
that for any U,V and any user, the sensitivity of F(U,V) 
obeys 

AF(U, V) < 2 u>i min{m;, r}(5 + k) 2 := 2 B, 

as specified in the algorithm. The (e, 5)—DP claim is simple 
(given in Proposition 3 of [33]) and we omit here. 

Lastly, we note that the “retry if fail” procedure will always 
sample from the the correct distribution of P conditioned on 
({/, V) satisfying our constraint that uf Vi is bounded, and it 
does not affect the relative probability ratio of any measur¬ 
able event in the support of this conditional distribution. □ 


Proof of Theorem 2. For generality, we assume the parame¬ 
ter vector is 6 and all regularizes is capture in prior p(6). 
The posterior distribution p(0\x 1 ,x n ) = 

For any aq, ...,x n , if we add (removing has the same proof) a 
particular user x' whose log-likelihood is uniformly bounded 
by B’. The probability ratio can be factorized into 

p(9\xi ,..., x n ,x ',) = p{x'\0)y[" =1 p{xi\O)p{0) 

P (0 i*i, ...,*„) nr=i p( x i\o)p(o) 

v s 

Factor 1 

J g Yl?=iP( x i\o)p(0) d 0 
x J e p(x'\e)UtM^\mo)d0' 


Factor 2 




















It follows that 


Factor 1 = p(x'\G) = e losp ^ ^ < e B , 

fgYl?=iP( x i\ 0 )p( 0 ) d0 


Factor 2 = 


J g P( x '\0 ) nr=i p(xi\d)p(6)dG 


IgU?=iPi x i\ 0 )p( 0 ) de 

- J e e^^\e) m=i p {Xt \ 0)p{ g )d9 


jgU^lPiXiWPiO)^ 


< fg e ~ Bz n n i=iP{ x i\ B )p( e ) de 


< e 


B' 


As a result, the whole thing is bounded by e 2B . 

In Algorithm 1, denote 9 = (U , V). We are sampling from 
a distribution proportional to This is equivalent 

to taking the above posterior p to have the log-likelihood 
of User x' bounded by = f . B H , therefore the algorithm 
obeys personalized differential privacy for user x'. Take 
B' to be any customized subset of Bi,..., B n adjustied using 
w we get the expression as claimed. □ 
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